
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Corpus Data &#8212; PyCantonese 2.2.0 documentation</title>
    <link rel="stylesheet" href="_static/sphinxdoc.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="next" title="Stop Words" href="stop_words.html" />
    <link rel="prev" title="Download and Install" href="download.html" /> 
  </head><body>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="stop_words.html" title="Stop Words"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="download.html" title="Download and Install"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> &#187;</li> 
      </ul>
    </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Corpus Data</a><ul>
<li><a class="reference internal" href="#the-chat-transcription-format">The CHAT Transcription Format</a></li>
<li><a class="reference internal" href="#accessing-built-in-data">Accessing Built-in Data</a></li>
<li><a class="reference internal" href="#accessing-custom-data">Accessing Custom Data</a></li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="download.html"
                        title="previous chapter">Download and Install</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="stop_words.html"
                        title="next chapter">Stop Words</a></p>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <div class="searchformwrapper">
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    </div>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="corpus-data">
<span id="data"></span><h1>Corpus Data<a class="headerlink" href="#corpus-data" title="Permalink to this headline">¶</a></h1>
<ul class="simple">
<li><a class="reference internal" href="#chat-format"><span class="std std-ref">The CHAT Transcription Format</span></a></li>
<li><a class="reference internal" href="#built-in-data"><span class="std std-ref">Accessing Built-in Data</span></a></li>
<li><a class="reference internal" href="#custom-data"><span class="std std-ref">Accessing Custom Data</span></a></li>
</ul>
<div class="section" id="the-chat-transcription-format">
<span id="chat-format"></span><h2>The CHAT Transcription Format<a class="headerlink" href="#the-chat-transcription-format" title="Permalink to this headline">¶</a></h2>
<p>PyCantonese adopts the CHAT format (as used in the CHILDES database for
language acquisition research) as the standard corpus format.
The choice is motivated by the fact that CHAT is widely used, well-documented,
and rich for linguistic annotations.</p>
<p>All built-in corpus datasets of PyCantonese are in the CHAT format.
Underlyingly, PyCantonese uses the Python library
<a class="reference external" href="http://pylangacq.org/">PyLangAcq</a> to parse CHAT data files.
For the bare minimum of the CHAT format that PyCantonese assumes,
see <a class="reference external" href="http://pylangacq.org/read.html#chat-format">here</a>.</p>
</div>
<div class="section" id="accessing-built-in-data">
<span id="built-in-data"></span><h2>Accessing Built-in Data<a class="headerlink" href="#accessing-built-in-data" title="Permalink to this headline">¶</a></h2>
<p>Currently, PyCantonese comes with one built-in Cantonese corpus, the
150,000-word <a class="reference external" href="http://compling.hss.ntu.edu.sg/hkcancor/">Hong
Kong Cantonese Corpus</a> (HKCanCor)
by Kang Kwong Luke, via <code class="docutils literal notranslate"><span class="pre">hkcancor()</span></code>:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">corpus</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">hkcancor</span><span class="p">()</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p>HKCanCor is released under a CC BY license.
If this corpus is used, the following should be cited:</p>
<p class="last">K. K. Luke and May L.Y. Wong (2015) The Hong Kong Cantonese Corpus:
Design and Uses. <em>Journal of Chinese Linguistics</em> (to appear).</p>
</div>
<p>On the CHAT format of HKCanCor incorporated in PyCantonese,
please consult this
<a class="reference external" href="https://github.com/pycantonese/pycantonese/blob/master/pycantonese/data/hkcancor/readme.md">readme</a>.</p>
</div>
<div class="section" id="accessing-custom-data">
<span id="custom-data"></span><h2>Accessing Custom Data<a class="headerlink" href="#accessing-custom-data" title="Permalink to this headline">¶</a></h2>
<p>If you have a Cantonese corpus in the CHAT format in your local drive and would
like to use PyCantonese to handle it, the function <code class="docutils literal notranslate"><span class="pre">read_chat()</span></code> is available
for this purpose:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">corpus</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">read_chat</span><span class="p">(</span><span class="s1">&#39;path/to/files/*.cha&#39;</span><span class="p">)</span>
</pre></div>
</div>
<p>If your CHAT data files have the extension name <code class="docutils literal notranslate"><span class="pre">.cha</span></code> and are all in
a single directory, then filename pattern matching with <code class="docutils literal notranslate"><span class="pre">*</span></code> can be used to
match all CHAT files in the specified directory.</p>
<p><code class="docutils literal notranslate"><span class="pre">read_chat()</span></code> has the optional parameter <code class="docutils literal notranslate"><span class="pre">encoding</span></code> which defaults to
<code class="docutils literal notranslate"><span class="pre">utf8</span></code> for UTF-8 and can be overridden for another encoding if necessary.</p>
<p>If you are aware of other Cantonese corpora that could be incorporated into
PyCantonese for open access,
or if you are the owner of a Cantonese corpus and
would like to make it accessible
through PyCantonese, please contact <a class="reference external" href="http://jacksonllee.com">Jackson Lee</a>.</p>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="stop_words.html" title="Stop Words"
             >next</a> |</li>
        <li class="right" >
          <a href="download.html" title="Download and Install"
             >previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> &#187;</li> 
      </ul>
    </div>
    <div class="footer" role="contentinfo">
        &#169; Copyright 2014-2018, Jackson L. Lee | Documentation last updated on June 30, 2018.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.7.5.
    </div>
  </body>
</html>